其他
阿里云Spark Shuffle的优化
Spark Shuffle介绍
Smart Shuffle设计
性能分析
Spark Shuffle流程
Spark 0.8及以前 Hash Based Shuffle
Spark 0.8.1 为Hash Based Shuffle引入File Consolidation机制
Spark 0.9 引入ExternalAppendOnlyMap
Spark 1.1 引入Sort Based Shuffle,但默认仍为Hash Based Shuffle
Spark 1.2 默认的Shuffle方式改为Sort Based Shuffle
Spark 1.4 引入Tungsten-Sort Based Shuffle
Spark 1.6 Tungsten-sort并入Sort Based Shuffle
Spark 2.0 Hash Based Shuffle退出历史舞台
Spark Shuffle实现
Sort-based shuffle介绍
// Let the user specify short names forshuffle managers
val shortShuffleMgrNames = Map(
"hash" ->"org.apache.spark.shuffle.hash.HashShuffleManager",
"sort" ->"org.apache.spark.shuffle.sort.SortShuffleManager")
val shuffleMgrName =conf.get("spark.shuffle.manager", "sort") //获得Shuffle Manager的type,sort为默认
val shuffleMgrClass =shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName)
val shuffleManager =instantiateClass[ShuffleManager](shuffleMgrClass)
Spark-shuffle存在的问题
Smart Shuffle
Smart Shuffle使用
配置spark.shuffle.manager : org.apache.spark.shuffle.hash.HashShuffleManager
配置spark.shuffle.smart.spill.memorySizeForceSpillThreshold:控制shuffle数据占用内存的大小,默认为128M
配置spark.shuffle.smart.transfer.blockSize:控制shuffle在网络传输数据块的大小
性能分析
Smart shuffle没有打来单个query性能的下降
单个query最大能够带来最大2倍的性能提升
Q2在两种shuffle性能保持一致
Q49在Smart shuffle下性能有很大提升
文章不错?点个【在看】吧! 👇